home *** CD-ROM | disk | FTP | other *** search
-
-
- DATA COMPRESSION ALGORITHMS OF ARC, PKZIP, AND LHARC
- ====================================================
-
-
- Much of the typical modem user's online time is spent performing
- uploads or downloads of files from BBS's, Online Services like Compuserve
- or GEnie, or Information Networks like Usenet or Internet. Given that
- this always takes up a lot of time, and usually costs a considerable
- amount of money, the need to shorten the time necessary to perform file
- transfers, and other modem applications has always been prevalent. One
- innovation in this field has been the development of advanced Algorithms
- for compacting, or compressing data so it takes up much less space, and
- packing multiple files into one Archive, or data file, so many files can
- be sent at one time.
-
- The current technology, an offspring of data encryption methods used
- in World War II, reduces the time it takes to transfer a file through a
- modem, by reducing the size of the data itself. Given the proliferation
- of many data compression methods (ARC, PKZIP, ZOO, SIT, and LHARC, for a
- few examples) that try to provide the most efficient method of data
- compression, the topic has always been controversial in nature.
-
- Haruhiko Okumura provided a great source of knowledge about data
- compression algorithms by writing this essay, which describes some of the
- effort involved in creating a data compression standard. Except for
- modifications in its formatting, or presentation, and various notes placed
- in this text to provide more information on certain subjects, the content
- of Haruhiko Okumura's text is identical....
-
-
- Introduction: History of LHARC's Forefathers
- ---------------------------------------------
-
- In the spring of 1988, I wrote a very simple data compression program
- named LZSS in C language, and uploaded it to the Science SIG (forum) of
- PC-VAN, Japan's biggest personal computer network. That program was based
- on Storer and Szymanski's slightly modified version of one of Lempel and
- Ziv's algorithms. Despite its simplicity, for most files its compression
- outperformed the archivers then widely used.
-
- Kazuhiko Miki rewrote my LZSS in Turbo Pascal and assembly language,
- and soon made it evolve into a complete archiver, which he named LARC. The
- first versions of LZSS and LARC were rather slow. So I rewrote my LZSS
- using a binary tree, and so did Miki. Although LARC's encoding was slower
- than the fastest archiver available, its decoding was quite fast, and its
- algorithm was so simple that even self-extracting files (compressed files
- plus decoder) it created were usually smaller than non-self-extracting
- files from other archivers.
-
- Soon many hobby programmers joined the archiver project at the forum.
- Very many suggestions were made, and LARC was revised again and again. By
- the summer of 1988, LARC's speed and compression have improved so much
- that LARC-compressed programs were beginning to be uploaded in many forums
- of PC-VAN and other networks.
-
- In that summer I wrote another program, LZARI, which combined the LZSS
- algorithm with adaptive arithmetic compression. Although it was slower
- than LZSS, its compression performance was amazing. Miki, the author of
- LARC, uploaded LZARI to NIFTY-Serve, another big information network in
- Japan. In NIFTY-Serve, Haruyasu Yoshizaki replaced LZARI's adaptive
- arithmetic coding with a version of adaptive Huffman coding to increase
- speed. Based on this algorithm, which he called LZHUF, he developed yet
- another archiver, LHarc.
-
-
- Data Compression Algorithms, Lempel-Ziv, and ARC.TTP
- ----------------------------------------------------
-
- In what follows, I will review several of these algorithms and
- supply simplified codes in C language.
-
-
- 1. RLL Encoding
-
- Replacing several (usually 8 or 4) "space" characters by one "tab"
- character is a very primitive method for data compression. Another simple
- method is Run-Length coding , which encodes the message "AAABBBBAACCCC"
- into "3A4B2A4C", for example.
-
-
- 2. LZSS coding
-
- This scheme is initiated by Ziv and Lempel [1]. A slightly modified
- version is described by Storer and Szymanski [2]. An implementation using
- a binary tree is proposed by Bell [3]. The algorithm is quite simple:
- Keep a ring buffer, which initially contains "space" characters only.
- Read several letters from the file to the buffer. Then search the buffer
- for the longest string that matches the letters just read, and send its
- length and position in the buffer.
-
- If the buffer size is 4096 bytes, the position can be encoded in 12
- bits. If we represent the match length in four bits, the <position,
- length> pair is two bytes long. If the longest match is no more than two
- characters, then we send just one character without encoding, and restart
- the process with the next letter. We must send one extra bit each time to
- tell the decoder whether we are sending a <position, length> pair or an
- unencoded character.
-
-
- 3. LZW coding
-
- This scheme was devised by Ziv and Lempel [4], and modified by Welch
- [5]. The LZW coding has been adopted by most of the existing archivers,
- such as ARC and PKZIP. The algorithm can be made relatively fast, and is
- suitable for hardware implementation as well. A Pascal program for this
- algorithm is given in Storer's book [6].
-
-
- The algorithm can be outlined as follows: Prepare a table that can
- contain several thousand items. Initially register in its 0th through
- 255th positions the usual 256 characters. Read several letters from the
- file to be encoded, and search the table for the longest match. Suppose
- the longest match is given by the string "ABC". Send the position of
- "ABC" in the table. Read the next character from the file. If it is "D",
- then register a new string "ABCD" in the table, and restart the process
- with the letter "D". If the table becomes full, discard the oldest item
- or, preferably, the least used.
-
-
- 4. Huffman coding
-
- Classical Huffman coding is invented by Huffman [7]. A fairly
- readable account is given in Sedgewick [8]. Suppose the text to be
- encoded is "ABABACA", with four A's, two B's, and a C. We represent this
- situation as follows:
-
- 4 2 1
- | | |
- A B C
-
- Combine the least frequent two characters into one, resulting in the
- new frequency 2 + 1 = 3:
-
- 4 3
- | / \
- A B C
-
- Repeat the above step until the whole characters combine into a tree:
-
- 7
- / \
- / 3
- / / \
- A B C
-
- Start at the top ("root") of this encoding tree, and travel to the
- character you want to encode. If you go left, send a "0"; otherwise send
- a "1". Thus, "A" is encoded by "0", "B" by "10", "C" by "11". Altogether,
- "ABABACA" will be encoded into ten bits, "0100100110". To decode this
- code, the decoder must know the encoding tree, which must be sent
- separately.
-
- A modification to this classical Huffman coding is the adaptive, or
- dynamic, Huffman coding. See, e.g., Gallager [9]. In this method, the
- encoder and the decoder processes the first letter of the text as if the
- frequency of each character in the file were one, say. After the first
- letter has been processed, both parties increment the frequency of that
- character by one. For example, if the first letter is 'C', then freq
- ['C'] becomes two, whereas every other frequencies are still one. Then
- the both parties modify the encoding tree accordingly. Then the second
- letter will be encoded and decoded, and so on.
-
-
- 5. Arithmetic coding
-
- The original concept of arithmetic coding is proposed by P. Elias. An
- implementation in C language is described by Witten and others [10].
-
- Although the Huffman coding is optimal if each character must be
- encoded into a fixed (integer) number of bits, arithmetic coding wins if
- no such restriction is made.
-
- As an example we shall encode "AABA" using arithmetic coding. For
- simplicity suppose we know beforehand that the probabilities for "A" and
- B" to appear in the text are 3/4 and 1/4, respectively.
-
- Initially, consider an interval:
-
- 0 <= x < 1.
-
- Since the first character is "A" whose probability is 3/4, we shrink
- the interval to the lower 3/4:
-
- 0 <= x < 3/4.
-
- The next character is "A" again, so we take the lower 3/4:
-
- 0 <= x < 9/16.
-
- Next comes "B" whose probability is 1/4, so we take the upper 1/4:
-
- 27/64 <= x < 9/16,
-
- Because "B" is the second element in our alphabet, {A, B}. The last
- character is "A" and the interval is
-
- 27/64 <= x < 135/256,
-
- which can be written in binary notation
-
- 0.011011 <= x < 0.10000111.
-
- Choose from this interval any number that can be represented in fewest
- bits, say 0.1, and send the bits to the right of "0."; in this case we
- send only one bit, "1". Thus we have encoded four letters into one bit!
- With the Huffman coding, four letters could not be encoded into less than
- four bits.
-
- To decode the code "1", we just reverse the process: First, we supply
- the "0." to the right of the received code "1", resulting in "0.1" in
- binary notation, or 1/2. Since this number is in the first 3/4 of the
- initial interval 0 <= x < 1, the first character must be "A". Shrink the
- interval into the lower 3/4. In this new interval, the number 1/2 lies in
- the lower 3/4 part, so the second character is again "A", and so on. The
- number of letters in the original file must be sent separately (or a
- special 'EOF' character must be appended at the end of the file).
-
- The algorithm described above requires that both the sender and
- receiver know the probability distribution for the characters. The
- adaptive version of the algorithm removes this restriction by first
- supposing uniform or any agreed-upon distribution of characters that
- approximates the true distribution, and then updating the distribution
- after each character is sent and received.
-
-
- 6. LZARI
-
- In each step the LZSS algorithm sends either a character or a
- <position, length> pair. Among these, perhaps character "e" appears more
- frequently than "x", and a <position, length> pair of length 3 might be
- commoner than one of length 18, say. Thus, if we encode the more frequent
- in fewer bits and the less frequent in more bits, the total length of the
- encoded text will be diminished. This consideration suggests that we use
- Huffman or arithmetic coding, preferably of an adaptive kind, along with
- LZSS. This is easier said than done, because there are many possible
- <position, length> combinations. Adaptive compression must keep running
- statistics of frequency distribution. Too many items make statistics
- unreliable.
-
-
- LZARI, and the Creation of a Data Compression Program
- -----------------------------------------------------
-
- What follows is not even an approximate solution to the problem posed
- above, but anyway this was what I did in the summer of 1988.
-
- I extended the character set from 256 to three-hundred or so in size,
- and let characters 0 through 255 be the usual 8-bit characters, whereas
- characters 253 + n represent that what follows is a position of string of
- length n, where n = 3, 4 , .... These extended set of characters will be
- encoded with adaptive arithmetic compression.
-
- I also observed that longest-match strings tend to be the ones that
- were read relatively recently. Therefore, recent positions should be
- encoded into fewer bits. Since 4096 positions are too many to encode
- adaptively, I fixed the probability distribution of the positions "by
- hand". The distribution function given in the accompanying LZARI.C is
- rather tentative; it is not based on thorough experimentation. In
- retrospect, I could encode adaptively the most significant 6 bits, say, or
- perhaps by some more ingenious method adapt the parameters of the
- distribution function to the running statistics.
-
- At any rate, the present version of LZARI treats the positions rather
- separately, so that the overall compression is by no means optimal.
- Furthermore, the string length threshold above which strings are coded
- into <position, length> pairs is fixed, but logically its value must
- change according to the length of the <position, length> pair we would
- get.
-
-
- 7. LZHUF
-
- LZHUF, the algorithm of Haruyasu Yoshizaki's archiver LHarc, replaces
- LZARI's adaptive arithmetic coding with adaptive Huffman. LZHUF encodes
- the most significant 6 bits of the position in its 4096-byte buffer by
- table lookup. More recent, and hence more probable, positions are coded
- in less bits. On the other hand, the remaining 6 bits are sent verbatim.
-
- Because Huffman coding encodes each letter into a fixed number of
- bits, table lookup can be easily implemented. Though theoretically
- Huffman cannot exceed arithmetic compression, the difference is very
- slight, and LZHUF is fairly fast.
-
-
- References:
- -----------
-
- [1] J. Ziv and A. Lempel, IEEE Trans. IT-23, 337-343 (1977).
-
- [2] J. A. Storer and T. G. Szymanski, J. ACM, 29, 928-951
- (1982).
-
- [3] T. C. Bell, IEEE Trans. COM-34, 1176-1182 (1986).
-
- [4] J. Ziv and A. Lempel, IEEE Trans. IT-24, 530-536 (1978).
-
- [5] T. A. Welch, Computer, 17, No.6, 8-19 (1984).
-
- [6] J. A. Storer, Data Compression: Methods and Theory
- (Computer Science Press, 1988).
-
- [7] D. A. Huffman, Proc IRE 40, 1098-1101 (1952).
-
- [8] R. Sedgewick, Algorithms, 2nd ed. (Addison-Wesley, 1988).
-
- [9] R. G. Gallager, IEEE Trans. IT-24, 668-674 (1978).
-
- [10] I. E. Witten, R. M. Neal, and J. G. Cleary, Commun. ACM
- 30, 520-540 (1987).
- ________________________________________________________
-
- Utdrag ur ST Report av den 15 mars 1991
-
- 29 mars 1991 - Ghlenn Willard 6929
-
-